4  Low dimensional visualizations

4.1 Why plotting?

Plotting is crucial to data science because:

  • It facilitates making new observations by discovering associations or patterns in the data (the initial step of the scientific method 1). The human brain is particularly good at detecting patterns in images, which is what we evolved for. Visual display, over staring at tables of numbers, is very effective.

  • It facilitates communicating findings.

  • Only relying on summary statistics (mean, correlation, etc.) is dangerous. Summary statistics reduce data to a single number, therefore carry much less information than 2D representations. Section 4.1.1 provides examples.

  • It helps debugging either the code by visually checking whether particular operations performed as expected on the data, or by identifying “bugs in the data” such as wrong entries or outliers. Section 4.1.2 provides an example.

4.1.1 Plotting versus summary statistics

What do those 13 datasets have in common?

All those plots, including the infamous datasaurus share the same following statistics:

  • X mean: 52.26
  • Y mean: 47.83
  • X standard deviation: 16.76
  • Y standard deviation: 29.93
  • Pearson correlation: -0.06

When only looking at the statistics, we would have probably wrongly assumed that the datasets were identical. This example highlights why it is important to visualize data and not just rely on summary statistics. See https://github.com/lockedata/datasauRus or Anscombe’s quartet https://en.wikipedia.org/wiki/Anscombe%27s_quartet for more examples.

4.1.2 Plotting helps finding bugs in the data

Consider the following vector height containing (hypothetical) height measurements for 500 adults:

height_df.head()
height
0 1.81
1 1.75
2 1.77
3 1.83
4 1.81

Calculating the mean height returns the following output:

height_df['height'].mean()
np.float64(2.05558)

There is something obviously wrong. We can plot the data to investigate.

(ggplot(height_df, aes('height')) + geom_histogram(bins=50) + mytheme)

There is an outlier (height=165). One particular value has probably been entered in centimeters rather than meters. As a result, the mean is inflated.

A quick way to fix our dataset is to remove the outlier, for instance with:

height_df = height_df[height_df['height'] < 3]

Now our plotted data seems more realistic and the mean height makes sense.

height_df['height'].mean()
np.float64(1.7290380761523045)
(ggplot(height_df, aes('height')) + geom_histogram() + mytheme)

While developing analysis scripts, we recommend to frequently visualize the data to make sure no mistake in the input or during the processing occurred.

4.2 Grammar of graphics

The grammar of graphics is a visualization theory developed by Leland Wilkinson in 1999. It has influenced the development of graphics and visualization libraries alike. It is based on 3 key principles:

  • Separation of data from aesthetics (e.g. x and y-axis, color-coding)
  • Definition of common plot/chart elements (e.g. scatter plots, box-plots, etc.)
  • Composition of these common elements (one can combine elements as layers)

The library plotnine is a Python implementation inspired by ggplot2 and follows the grammar of graphics closely.

Here is a sophisticated motivating example. The plot shows the relationship between per-capita gross domestic product (GDP) and life expectancy at birth for the years 1977 and 2007 from the dataset gapminder:

gapminder = pd.read_csv("https://raw.github.com/jennybc/gapminder/refs/heads/main/inst/extdata/gapminder.tsv", sep="\t")
gm_df = gapminder[gapminder['year'].isin([1977, 2007])]
(
    ggplot(gm_df, aes(x='gdpPercap', y='lifeExp', color='continent', size='pop'))
    + geom_point()
    + scale_x_log10()
    + facet_grid('~year')
    + labs(x='per-capita GDP', y='Life expectancy at birth', size = 'Population')
    + mytheme  + theme(figure_size=(6,3))
)
Figure 4.1

We may, for instance, use such visualization to find differences in the life expectancy of each country and each continent.

The following section shows how to create such a sophisticated plot step by step.

4.2.1 Components of the layered grammar

Grammar of graphics composes plots by combining layers. The major layers are:

  • Always used:

    Data: a pandas.DataFrame where columns correspond to variables

    Aesthetics: mapping of data to visual characteristics - what we will see on the plot (aes) — position (x,y), color, size, shape, transparency

    Geometric objects: geometric representation defining the type of the plot data (geom_) — points, lines, boxplots, …

  • Often used:

    Scales: for each aesthetic, describes how a visual characteristic is converted to display values (scale_) — log scales, color scales, size scales, …

    Facets: describes how data is split into subsets and displayed as multiple sub graphs (facet_)

  • Useful, but with care:

    Stats: statistical transformations that typically summarize data (stat) — counts, means, medians, regression lines, …

  • Domain-specific usage:

    Coordinate system: describes 2D space that data is projected onto (coord_) — Cartesian coordinates, polar coordinates, map projections, …

4.2.2 Defining the data and layers

To demonstrate the application of grammar of graphics, we will build step by step the gapminder figure Figure 4.1. First, we have a look at the first lines of the dataset:

gm_df[['country','continent','gdpPercap','lifeExp','year']].head()
country continent gdpPercap lifeExp year
5 Afghanistan Asia 786.113360 38.438 1977
11 Afghanistan Asia 974.580338 43.828 2007
17 Albania Europe 3533.003910 68.930 1977
23 Albania Europe 5937.029526 76.423 2007
29 Algeria Africa 4910.416756 58.014 1977

For starting with the visualization we initiate a ggplot object which generates a plot with background:

(ggplot(gm_df))

Next, we can define the data to be plotted, which needs to be a pandas.DataFrame and the aes() function. This aes() function defines which columns map to x and y coordinates and if they should be colored or have different shapes and sizes based on the values in a different column. These elements are called “aesthetic” elements.

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp')))

We want to visualize the data with a simple scatter plot. In a scatter plot, the values of two variables are plotted along two axes. Each pair of values is represented as a point. We combine the function geom_point() to create a scatter plot:

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp'))
  + geom_point())

One of the advantages of plotting with grammar-of-graphics libraries is that the plot object can be stored and modified. For example:

p = ggplot(gm_df, aes(x='gdpPercap', y='lifeExp')) + geom_point()
print(type(p))
<class 'plotnine.ggplot.ggplot'>

4.2.3 Mapping of aesthetics

Mapping of color, shape and size

We can easily map variables to different colors, sizes or shapes depending on the value of the specified variable. To assign each point to its corresponding continent, we can define the variable continent as the color attribute in aes():

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp', color='continent')) + geom_point())

Instead of color, we can also use different shapes for characterizing categories (note: some backends limit number of shapes):

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp', shape='continent')) + geom_point())

Additionally, we distinguish the population of each country by giving a size to the points in the scatter plot:

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp', color='continent', size='pop')) + geom_point())

Global versus individual mapping

Mapping of aesthetics in aes() can be done globally or at individual layers. Global mapping is inherited by default to all geom layers, while mapping at individual layers is only recognized at that layer. Example:

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp')) + geom_point(aes(color='continent', size='pop')))

Note that individual layer mapping cannot be recognized by other layers. For instance, we can add another layer for smoothing with stat_smooth().

# this doesn't work as stat_smooth didn't know aes(x , y)
(ggplot(gm_df) +
  geom_point(aes(x='gdpPercap', y='lifeExp')) +
  stat_smooth(color='blue'))
# this works but is redundant
(ggplot(gm_df) +
  geom_point(aes(x='gdpPercap', y='lifeExp')) +
  stat_smooth(aes(x='gdpPercap', y='lifeExp'), color='blue'))

# the common aes(x, y) shared by all the layers can be put in the ggplot()
(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp')) +
  geom_point() +
  stat_smooth(color='blue'))

4.2.4 Facets, axes and labels

For comparing the data from different years, we can add a facet with facet_wrap():

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp', color='continent', size='pop'))
   + geom_point() 
   + facet_wrap('~year')
   + theme(figure_size=(6,3)))

For a better visualization of the data points, we can consider log scaling (detailed in section Section 4.3.4.1.2). Finally, we can adapt the axes labels of the plot with labs() and define a theme of our plot:

mysize = 9
mytheme = theme(
    axis_title = element_text(size=mysize), 
    axis_text = element_text(size=mysize),
    legend_title = element_text(size=mysize),
    legend_text = element_text(size=mysize),
    ) + theme_minimal(base_size=mysize)

(ggplot(gm_df, aes(x='gdpPercap', y='lifeExp'))
    + geom_point(aes(color='continent', size='pop'))
    + facet_grid('~year')
    + scale_x_log10()
    + labs(x='Per-capita GDP', y='Life expectancy at birth', size='Population')
    + mytheme + theme(figure_size=(6,3)))

4.3 Different types of one- and two-dimensional plots

In the previous examples, we had a look at scatter plots which are suitable for plotting the relationship between two continuous variables. However, there are many more types of plots (e.g. histograms, boxplots) which can be used for plotting in different scenarios. Mainly, we distinguish between plotting one or two variables and whether the variables are continuous or discrete.

4.3.1 Plots for one single continuous variable

Histograms

A histogram represents the frequencies of values of a variable bucketed into ranges or bins. It takes as input numeric variables only. The height of a bar for a given range in a histogram represents the number of values present in that bin.

We will make use of a dataset collecting Human Development Index (HDI from http://hdr.undp.org/) and Corruption Perception Index (CPI from http://www.transparency.org/) of various countries. We first load these data into a new data table ind and have a first look at the table:

ind = pd.read_csv('../../extdata/CPI_HDI.csv').drop(columns=['Unnamed: 0'])
ind.head()
country wbcode CPI HDI region
0 Afghanistan AFG 12 0.465 Asia Pacific
1 Albania ALB 33 0.733 East EU Cemt Asia
2 Algeria DZA 36 0.736 MENA
3 Angola AGO 19 0.532 SSA
4 Argentina ARG 34 0.836 Americas
(ggplot(ind, aes('HDI')) + geom_histogram() + mytheme)

We can change the number of desired bins 2 in the bins argument of the geom_histogram() function:

(ggplot(ind, aes('HDI')) + geom_histogram(bins=10) + mytheme)

Density plots

In some situations, histograms are not the best choice to investigate the distribution of a variable due to discretization effects during the binning process. A variation of histograms is given by density plots. They are used to represent the distribution of a numeric variable. These distribution plots are typically obtained by kernel density estimation to smoothen out the noise. Thus, the plots are smooth across bins and are not affected by the number of bins, which helps create a more defined distribution shape.

As an example, we can visualize the distribution of the Human Development Index (HDI) in the ind dataset by means of a density plot with geom_density():

(ggplot(ind, aes('HDI')) + geom_density() + mytheme)

The bw argument of the geom_density() function allows to tweak the bandwidth of a density plot manually. The default option is a bandwidth rule, which is usually a good choice.

Setting a small bandwidth on the previous plot has a huge impact on the plot:

(ggplot(ind, aes('HDI')) + geom_density(bw=0.01) + ggtitle('Small bandwidth') + mytheme)

Setting a large bandwidth has also a huge impact on the plot:

(ggplot(ind, aes('HDI')) + geom_density(bw=1) + ggtitle('Large bandwidth') + mytheme)

Thus, we should be careful when changing the bandwidth, since we can get a wrong impression from the distribution of a continuous variable.

Boxplots

Boxplots can give a good graphical insight into the distribution of the data. They show the median, quartiles, and how far the extreme values are from most of the data.

Four values are essential for constructing a boxplot:

  • the median
  • the first quartile (Q1)
  • the third quartile (Q3)
  • the interquartile range (IQR): the difference between Q3 and Q1

See http://web.pdx.edu/~stipakb/download/PA551/boxplot.html for a good illustration.

Every boxplot has lines at Q1, the median, and Q3, which together build the box of the boxplot. The other major feature of a boxplot is its whiskers. The whiskers are determined with the help of the IQR. Data points outside of 1.5 x IQR is called an outlier. We then draw lines at the smallest and largest point within this subset (Q1 - 1.5 × IQR to Q3 + 1.5 × IQR) from the dataset. These lines define our whiskers which reach the most extreme data point within \(\pm 1.5\times IQR\).

It is possible to not show the outliers in boxplots. However, we strongly recommend keeping them. Outliers can reveal interesting data points (discoveries “out of the box”) or bugs in data preprocessing.

For instance, we can plot the distribution of a variable x with a histogram and visualize the corresponding boxplot:

Boxplots are particularly suited for plotting non-Gaussian symmetric and non-symmetric data and for plotting exponentially distributed data. However, boxplots are not well suited for bimodal data, since they only show one mode (the median). In the following example, we see a bimodal distribution in the histogram and the corresponding boxplot, which does not properly represent the distribution of the data.

Boxplots are also not suited for categorical data and discrete data with very few values, for which bar plots are preferred (Section 4.3.3.1).

4.3.2 Assessing distributional assumptions with Q-Q Plots

As we will see in upcoming chapters, several tests assume that the data follows a particular distribution. We will now explore a plot which we can use to check whether such an assumption is reasonable.

Limitations of Histograms

We already know a plot which can be used to visualize distributions, namely the histogram. We might think that it could be used to check distributional assumptions. However, this is somewhat complicated by the difficulty of choosing the right bin size. Consider, for example, the following histogram, visualizing a sample taken from a uniform distribution on the interval 0 to 1.:

x = np.random.rand(50)
(ggplot(pd.DataFrame({'x': x}), aes(x='x'))
  + geom_histogram(bins=20)
  + mytheme)

Just looking at the histogram, it is hard to see that the underlying data comes from the uniform distribution.

Q-Q plots: Comparing empirical to theoretical quantiles

What could be a better approach here? One thing we can do is look at the quantiles.

The basic idea here is as follows: if the data actually follows a uniform distribution on the interval 0 to 1, then we expect 10% of the data in the interval [0,0.1], 20% in the interval [0,0.2], and so on…

We can now compute whether our data conforms to this expectation. We get that:

dec = np.quantile(x, np.arange(0, 1.1, 0.1))
dec
array([0.00549482, 0.07751476, 0.24225082, 0.29552267, 0.37907005,
       0.50416999, 0.60034441, 0.70164272, 0.80973681, 0.9195869 ,
       0.99490191])

Here we implicitly chose to always make jumps of \(10\%\). These quantiles are therefore called deciles.

The package scipy (Scientific Python) provides statistical methods in Python. It uses numpy underneath. For every implemented distribution, scipy.stats offers the random draw (rvs), the density (pdf), the cumulative disibtuion function (cdf), and its inverse (ppf) which allows you get quantiles e.g.: norm.rvs(), uniform.rvs(), etc… In general, these are not unbiased estimators.

We can make a scatter plot which compares the expected and the theoretical deciles:

(ggplot(pd.DataFrame({'x': np.arange(0,1.1,0.1), 'y':dec}), aes(x='x',y='y'))
  + geom_point()
  + xlim(0,1)
  + ylim(0,1)
  + xlab('Deciles of the uniform distribution')
  + ylab('Deciles of the dataset')
  + geom_abline(intercept=0, slope=1) # Add y=x line
  + mytheme)

We see that they match quite well.

For a finite sample we can estimate the quantile for every data point. One way is to use as expected quantile \((r-0.5)/N\) (Hazen, 1914), where \(r\) is the rank of the data point.

(ggplot(pd.DataFrame({'x': np.sort(x),
                      'y': (np.arange(1, len(x)+1) - 0.5) / len(x)}), aes(x='y', y='x'))
  + geom_point()
  + xlim(0, 1)
  + ylim(0, 1)
  + xlab('Quantiles of the uniform distribution')
  + ylab('Quantiles of the dataset')
  + geom_abline(intercept=0, slope=1)
  + mytheme)

This is called a Q-Q plot, which is short for Quantile-Quantile plot. When the distribution matches the data, as above, the points should be close the diagonal.

Typical Q-Q plots

Figure 4.2 gives more examples. We assume here the Normal distribution (Gaussian with mean 0 and variance 1) as reference theoretical distribution. These plots show how different violations of the distributional assumption translate to different deviations from the diagonal in a Q-Q plot.

Figure 4.2: Examples of Q-Q plots. The theoretical distribution is in each case the Normal distribution (Gaussian with mean 0 and variance 1). The upper row shows histograms of some observations, the lower row shows the matching Q-Q plots. The vertical red dashed line marks the theoretical mean (0, top row) and the red lines the y=x diagonal (bottom row).

The middle three plots show what happens when one particular aspect of the distributional assumption is incorrect. The second from the left shows what happens if the data has a mean higher than we expected, but otherwise follows the distribution. The middle one shows what happens if the data has fatter tails (i.e. more outliers) than we expected - this occurs frequently in practice. The second from the right shows what happens if the distribution is narrower than expected. The last plot shows a combination of these phenomena. There the data come from a non-negative asymmetric distribution 3. The Q-Q plot shows a lack of low values (capped at 0) and an excess of high values.

4.3.3 Plots for two variables: one continuous, one discrete

Barplots

Barplots are often used to highlight individual quantitative values per category. Bars are visual heavyweights compared to dots and lines. In a barplot, we can combine two attributes of 2-D location and line length to encode quantitative values. In this manner, we can focus the attention primarily on individual values and support the comparison of one to another.

For creating a barplot with plotnine we can use the function geom_bar(). In the next example, we visualize the number of countries (defined in the y axis) per continent (defined in the x axis).

countries_dt = pd.DataFrame({
    'Continent': ["North America", "South America", "Africa", "Asia", "Europe", "Oceania"],
    'Number_countries': [23, 12, 54, 49, 50, 16]
})

(ggplot(countries_dt, aes('Continent', 'Number_countries')) + geom_bar(stat='identity', width=0.7)
    + mytheme + theme(figure_size=(6,3)))

Barplots with errorbars

Visualizing uncertainty is important, otherwise, barplots with bars as a result of an aggregation can be misleading. One way to visualize uncertainty is with error bars.

As error bars, we can consider the standard deviation (SD) and the standard error of the mean (SEM). SD and SEM are related yet different concepts. On the one hand, SD indicates the variation of quantity in the sample. On the other hand, SEM represents how well the mean is estimated.

The central limit theorem implies that \(SEM = SD / \sqrt{n}\) , where \(n\) is the sample size, i.e. the number of observations. With large \(n\), SEM tends to 0, i.e. our uncertainty about the distibution’s expected value decreases with larger samples sizes. In contrast, SD converges with large sample size to the distribution’s standard deviation.

In the following example (Figure Figure 4.3), we plot the average highway miles per gallon hwy per vehicle class class including error bars computed as the average plus/minus standard deviation of hwy. Because of the various possibilities of error bars used in the literature, it is recommended to always specify in the legend of a figure what the error bars represent.

from plotnine.data import mpg
mpg_df = mpg.copy()
summary = mpg_df.groupby('class').agg(mean_hwy=('hwy', 'mean'), sd_hwy=('hwy', 'std')).reset_index()
summary['ymax'] = summary['mean_hwy'] + summary['sd_hwy']
summary['ymin'] = summary['mean_hwy'] - summary['sd_hwy']

(ggplot(summary, aes('class', 'mean_hwy', ymax='ymax', ymin='ymin'))
    + geom_bar(stat='identity')
    + geom_errorbar(width=0.3)
    + mytheme + theme(figure_size=(6,3)))
Figure 4.3: Mean +/- standard deviation of highway miles per gallon per car class.

Boxplots by category

As illustrated before, boxplots are well suited for plotting one continuous variable. However, we can also use boxplots to show distributions of continuous variables with respect to some categories. This can be particularly interesting for comparing the different distributions of each category.

For instance, we want to visualize the highway miles per gallon hwy for every one of the 7 vehicle classes (compact, SUV, minivan, etc.). For this, we define the categorical class variable on the x axis and the continuous variable hwy on the y axis.

(ggplot(mpg_df, aes('class', 'hwy')) + geom_boxplot() + mytheme + theme(figure_size=(6,3)))

One can also use geom_jitter(), which arbitrarily adds some randomness to the x-position of the data points within each box in order to separate them visually.

p = ggplot(mpg_df, aes('class', 'hwy')) + geom_boxplot() + mytheme + theme(figure_size=(6,3))
(p + geom_jitter(width=0.2))

Violin plots

A violin plot is an alternative to the boxplot for visualizing one continuous variable (grouped by categories). An advantage of the violin plot over the boxplot is that it also shows the entire distribution of the data. This can be particularly interesting when dealing with multimodal data.

For a direct comparison, we show a violin plot for the hwy grouped by class as before with the help of the function geom_violin():

(ggplot(mpg_df, aes('class', 'hwy')) + geom_violin() + mytheme + theme(figure_size=(6,3)))

4.3.4 Plots for two continuous variables

Scatter plots

Scatter plots are a useful plot type for easily visualizing the relationship between two continuous variables. Here, dots are used to represent pairs of values corresponding to the two considered variables. The position of each dot on the horizontal (x) and vertical (y) axis indicates values for an individual data point.

In the next example, we analyze the relationship between the engine displacement in liters displ and the highway miles per gallon hwy from the mpg dataset:

(ggplot(mpg_df, aes('displ', 'hwy')) + geom_point() + mytheme)

We can modify the previous plot by coloring the points depending on the vehicle class:

(ggplot(mpg_df, aes('displ', 'hwy', color='class')) + geom_point() + mytheme)

Sometimes, too many colors can be hard to distinguish. In such cases, we can use facet to separate them into different plots:

(ggplot(mpg_df, aes('displ', 'hwy')) + geom_point() + facet_wrap('~class')
   + mytheme + theme(figure_size=(6,4)))

Text labeling

For labeling the individual points in a scatter plot, plotnine offers the function geom_text(). However, these labels tend to overlap. To avoid this, we can use the argument adjust_text.

We first show the output of the classic text labeling with geom_text() for a random subset of 40 observations of the dataset mpg. Here we plot the engine displacement in liters displ vs. the highway miles per gallon hwy and label by manufacturer:

mpg_subset = mpg_df.sample(n=30, random_state=12)
adjust_text_dict = {'arrowprops': {'arrowstyle': '-','color': 'grey'}}
(ggplot(mpg_subset, aes('displ', 'hwy', label='manufacturer')) + geom_point() + geom_text(adjust_text=adjust_text_dict) + mytheme)
3 [ 0.51353121 -0.8420075 ]
12 [ 0.89283102 -0.06355701]
6 [-0.32775383 -0.76638599]
9 [-0.13480671  0.31960806]
16 [-0.81102544 -0.38746181]
8 [0.67203517 0.29236205]
23 [-0.58725398 -0.67818041]
15 [ 0.80897005 -0.42595956]
25 [-0.64099966  0.99992804]
0 [-0.38161634 -0.75051962]
2 [-0.07648474 -0.13130978]

Log scaling

We consider another example where we want to plot the weights of the brain and body of different animals using the dataset Animals. This is what we obtain after creating a scatterplot.

animals_df = pd.read_csv("https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/MASS/Animals.csv")
(ggplot(animals_df, aes('body', 'brain')) + geom_point() + mytheme)

We can clearly see that there are a few points which are notably larger than most of the points. This makes it harder to interpret the relationships between most of these points. In such cases, we can consider logarithmic transformations and/or scaling. More precisely, a first idea would be to manually transform the values into a logarithmic space and plot the transformed values instead of the original values:

animals_df["log_body"] = np.log10(animals_df["body"])
animals_df["log_brain"] = np.log10(animals_df["brain"])
(ggplot(animals_df, aes('log_body', 'log_brain')) + geom_point() + mytheme)

Alternatively, plotnine offers to simply scale the data without the need to transform. This can be done with the help of the functions scale_x_log10() and scale_y_log10() which allow appropriate scaling and labeling of the axes:

(ggplot(animals_df, aes('body', 'brain')) + geom_point() + scale_x_log10() + scale_y_log10() + mytheme)

4.4 2D-Density plots

Using scatterplots can become problematic when dealing with a huge number of points. This is due to the fact that points may overlap and we cannot clearly see how many points are at a certain position. In such cases, a 2D density plot is particularly well suited. This plot counts the number of observations within a particular area of the 2D space.

The function geom_bin_2d() is creates 2D density plots in python:

x = np.random.randn(10000)
y = x + np.random.randn(10000)
df = pd.DataFrame({'x': x, 'y': y})
(ggplot(df, aes('x', 'y')) + geom_bin_2d() + mytheme)

Line plots

A line plot can be considered for connecting a series of individual data points or to display the trend of a series of data points. This can be particularly useful to show the shape of data as it flows and changes from point to point. We can also show the strength of the movement of values up and down through time.

As an example we show the connection between the individual datapoints of unemployment rate over the years:

(ggplot(economics, aes(x='date', y='unemploy_pop_ratio'))
  + geom_line()
  + mytheme)

4.5 Further plots for low dimensional data

4.5.1 Plot matrix

A plot matrix is useful for exploring the distributions and correlations of a few variables in a matrix-like representation. Here, for each pair of considered variables, a scatterplot is shown. Moreover, a density plot is created for every single variable (diagonal).

We can use the function sns.pairplot() from the library seaborn for constructing plot matrices:

import seaborn as sns
mpg = sns.load_dataset('mpg')
columns_to_plot = ['displacement', 'cylinders', 'mpg', 'horsepower']
sns.pairplot(mpg, vars=columns_to_plot, kind='scatter', diag_kind='kde', aspect=2, size=1)

This plot is recommended and suited for a handful of variables but does not scale up to much more variables.

4.6 Summary

This chapter covered the basics of the grammar of graphics and plotnine to plot low dimensional data. We introduced the different types of plots such as histograms, boxplots or barplots and discussed when to use which plot.

4.7 Resources

  • The ggplot book: https://ggplot2-book.org/

  • The plotnine guide: https://plotnine.org/reference/

  • Udacity’s Data Visualization and D3.js

    • https://www.udacity.com/courses/all
  • Graphics principles

    • https://onlinelibrary.wiley.com/doi/full/10.1002/pst.1912
    • https://graphicsprinciples.github.io/

  1. https://en.wikipedia.org/wiki/Scientific_method↩︎

  2. There does not seem to be any widely accepted heuristics to choose the number of bins of a histogram based on data. We recommend trying different values if the default seems to be suboptimal.↩︎

  3. simulated with the Negative binomial distribution np.random.negative_binomial(n=2, p=(2/102), num_samples)/100.↩︎